Problem Statement: Our Marketing team had a successful campaign in St. Petersburg, my aim with this project is to find ten such similar regions.
I will be using unsupervised learning to find cluster of Russian alcohol consumption pattern over the year. Once the clusters are identified, I will select ten regions from the cluster to which St. Petersburg belong.
The marketing team had sourced the historical sales volumes per capita for several different drinks types.
Highlights of the steps involved:
Based on the data model and analysis, following are the 10 regions for the marketing team to start their promotion
#pip install missingno
import pandas as pd
import missingno as msno
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from scipy.cluster.hierarchy import dendrogram, linkage
from sklearn.decomposition import PCA
import numpy as np
df = pd.read_csv(r'./data/russian_alcohol_consumption.csv')
df.head()
year | region | wine | beer | vodka | champagne | brandy | |
---|---|---|---|---|---|---|---|
0 | 1998 | Republic of Adygea | 1.9 | 8.8 | 3.4 | 0.3 | 0.1 |
1 | 1998 | Altai Krai | 3.3 | 19.2 | 11.3 | 1.1 | 0.1 |
2 | 1998 | Amur Oblast | 2.1 | 21.2 | 17.3 | 0.7 | 0.4 |
3 | 1998 | Arkhangelsk Oblast | 4.3 | 10.6 | 11.7 | 0.4 | 0.3 |
4 | 1998 | Astrakhan Oblast | 2.9 | 18.0 | 9.5 | 0.8 | 0.2 |
Checking data for it's integrety and quality, which would involve checking data quality in terms of completeness of the data, range of data and cleanliness of the data
# Running descriptive statistics of the data
df.describe()
year | wine | beer | vodka | champagne | brandy | |
---|---|---|---|---|---|---|
count | 1615.000000 | 1552.000000 | 1557.000000 | 1554.000000 | 1552.000000 | 1549.000000 |
mean | 2007.000000 | 5.628144 | 51.260148 | 11.818694 | 1.313177 | 0.526998 |
std | 5.478922 | 2.813208 | 25.372821 | 5.128806 | 0.797956 | 0.400201 |
min | 1998.000000 | 0.100000 | 0.400000 | 0.050000 | 0.100000 | 0.000000 |
25% | 2002.000000 | 3.575000 | 32.400000 | 8.300000 | 0.800000 | 0.200000 |
50% | 2007.000000 | 5.400000 | 49.970000 | 11.500000 | 1.200000 | 0.400000 |
75% | 2012.000000 | 7.377500 | 67.400000 | 15.000000 | 1.665000 | 0.700000 |
max | 2016.000000 | 18.100000 | 207.300000 | 40.600000 | 5.560000 | 2.300000 |
Clearly there are some missing data as the count is not uniform across all the columns. Also the minimum value is not 0, except for brandy, which may suggest that missing values are not represented by 0.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1615 entries, 0 to 1614 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 1615 non-null int64 1 region 1615 non-null object 2 wine 1552 non-null float64 3 beer 1557 non-null float64 4 vodka 1554 non-null float64 5 champagne 1552 non-null float64 6 brandy 1549 non-null float64 dtypes: float64(5), int64(1), object(1) memory usage: 88.4+ KB
Most of the data is in corect data type, except the year, I will convert it to date format
df['year']=pd.to_datetime(df['year'], format='%Y', errors='coerce').dt.year
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1615 entries, 0 to 1614 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 year 1615 non-null int64 1 region 1615 non-null object 2 wine 1552 non-null float64 3 beer 1557 non-null float64 4 vodka 1554 non-null float64 5 champagne 1552 non-null float64 6 brandy 1549 non-null float64 dtypes: float64(5), int64(1), object(1) memory usage: 88.4+ KB
Use of dt.year, gives the type as int, since the data is for particular year, it should be fine.
#Missing Data
df.isna().sum()
year 0 region 0 wine 63 beer 58 vodka 61 champagne 63 brandy 66 dtype: int64
Count of missing data is almost the same, need to check if there is any pattern around it.
msno.matrix(df)
plt.show()
Looks like the missing numbers are following a pattern. Sorting the data may help us in our quest to find some pattern
df.sort_values(['wine','beer','vodka','champagne','brandy'],inplace=True)
msno.matrix(df)
plt.show()
Looks like data is not missing at random
beverages = ['wine','beer','vodka','champagne','brandy']
for beverage in beverages:
print('-----------{}------------'.format(beverage))
print(df[df[beverage].isna()].value_counts('region'))
-----------wine------------ region Chechen Republic 19 Republic of Crimea 16 Sevastopol 16 Republic of Ingushetia 12 dtype: int64 -----------beer------------ region Chechen Republic 16 Republic of Crimea 16 Sevastopol 16 Republic of Ingushetia 10 dtype: int64 -----------vodka------------ region Chechen Republic 19 Republic of Crimea 16 Sevastopol 16 Republic of Ingushetia 10 dtype: int64 -----------champagne------------ region Chechen Republic 19 Republic of Crimea 16 Sevastopol 16 Republic of Ingushetia 12 dtype: int64 -----------brandy------------ region Chechen Republic 19 Republic of Crimea 16 Sevastopol 16 Republic of Ingushetia 15 dtype: int64
Clearly the Chechen Republic, Republic of Crimea, Sevastopol and Republic of Ingushetia are the regions with missing values. Which may point to the fact that these regions may have some political instability, making it difficult to collect data or some may have been annexed to Russia in the recent years. Plotting the year over year trend
# converting the data to long form for easier visulaization
df_long = pd.melt(df,id_vars=['year','region'],value_vars=beverages,var_name='beverage',value_name='sales_per_capita')
sns.relplot(x='year',y='sales_per_capita',data=df_long,kind='line',hue='beverage',col='region',col_wrap=3)
plt.xticks(rotation=45)
plt.ylabel('Trend Over the Year',y=.99)
plt.xlabel('')
plt.show()
Clearly for the four regions with missing data, the data was not collected before 2014. Any imputaion will add noise to the data, so for our models I will delete those records with missing data.
Next, step is to look for any outliers in the data
plt.figure(figsize=(15,8))
sns.boxplot(x='beverage',y='sales_per_capita', data = df_long)
plt.xticks(rotation=45)
plt.ylabel('Distribution of Sales Per Capita')
plt.xlabel('Beverage')
plt.title('Distribution of Data')
plt.show()
Beer has some outliers
df.describe()
year | wine | beer | vodka | champagne | brandy | |
---|---|---|---|---|---|---|
count | 1615.000000 | 1552.000000 | 1557.000000 | 1554.000000 | 1552.000000 | 1549.000000 |
mean | 2007.000000 | 5.628144 | 51.260148 | 11.818694 | 1.313177 | 0.526998 |
std | 5.478922 | 2.813208 | 25.372821 | 5.128806 | 0.797956 | 0.400201 |
min | 1998.000000 | 0.100000 | 0.400000 | 0.050000 | 0.100000 | 0.000000 |
25% | 2002.000000 | 3.575000 | 32.400000 | 8.300000 | 0.800000 | 0.200000 |
50% | 2007.000000 | 5.400000 | 49.970000 | 11.500000 | 1.200000 | 0.400000 |
75% | 2012.000000 | 7.377500 | 67.400000 | 15.000000 | 1.665000 | 0.700000 |
max | 2016.000000 | 18.100000 | 207.300000 | 40.600000 | 5.560000 | 2.300000 |
# Beer Oulier: 67.4 + 1.5*(67.4-32.4)
# Wine Outlier: 7.4 + 1.5 * (7.4 - 3.6)
# Vodka Outlier: 15 + 1.5 * (15 - 8.3)
df[df['beer']>119.9]
year | region | wine | beer | vodka | champagne | brandy | |
---|---|---|---|---|---|---|---|
896 | 2008 | Omsk Oblast | 4.80 | 126.30 | 8.70 | 1.00 | 0.30 |
1151 | 2011 | Omsk Oblast | 5.02 | 126.93 | 6.78 | 1.28 | 0.44 |
1236 | 2012 | Omsk Oblast | 5.10 | 125.20 | 7.00 | 1.40 | 0.50 |
1324 | 2013 | Penza Oblast | 6.30 | 128.50 | 6.00 | 1.00 | 0.40 |
739 | 2006 | Saint Petersburg | 7.60 | 125.30 | 16.40 | 2.40 | 1.40 |
994 | 2009 | Saint Petersburg | 7.70 | 143.00 | 9.40 | 2.40 | 1.30 |
909 | 2008 | Saint Petersburg | 10.40 | 125.90 | 12.50 | 3.60 | 2.10 |
928 | 2008 | Chelyabinsk Oblast | 12.40 | 129.40 | 15.30 | 1.70 | 0.60 |
1013 | 2009 | Chelyabinsk Oblast | 12.70 | 120.80 | 15.20 | 1.80 | 0.60 |
695 | 2006 | Zabaykalsky Krai | 13.70 | 207.30 | 31.70 | 3.10 | 1.30 |
525 | 2004 | Zabaykalsky Krai | 15.70 | 190.40 | 40.60 | 3.00 | 1.80 |
610 | 2005 | Zabaykalsky Krai | 16.30 | 202.10 | 31.00 | 3.00 | 1.20 |
df[df['wine']>13.1]
year | region | wine | beer | vodka | champagne | brandy | |
---|---|---|---|---|---|---|---|
962 | 2009 | Kirov Oblast | 13.40 | 77.00 | 14.20 | 1.40 | 0.50 |
450 | 2003 | Republic of Karelia | 13.50 | 31.10 | 14.20 | 0.60 | 0.20 |
477 | 2003 | Pskov Oblast | 13.50 | 52.90 | 10.90 | 0.60 | 0.10 |
960 | 2009 | Republic of Karelia | 13.60 | 52.40 | 16.40 | 1.40 | 1.00 |
1072 | 2010 | Pskov Oblast | 13.60 | 58.10 | 15.10 | 2.00 | 0.90 |
695 | 2006 | Zabaykalsky Krai | 13.70 | 207.30 | 31.70 | 3.10 | 1.30 |
1149 | 2011 | Novgorod Oblast | 13.71 | 49.97 | 14.89 | 2.51 | 1.06 |
521 | 2004 | Vologda Oblast | 14.20 | 85.10 | 26.70 | 0.60 | 0.70 |
875 | 2008 | Republic of Karelia | 14.30 | 50.00 | 17.40 | 1.40 | 0.80 |
894 | 2008 | Novgorod Oblast | 14.50 | 54.60 | 14.40 | 1.80 | 0.80 |
705 | 2006 | Republic of Karelia | 14.70 | 40.90 | 15.00 | 0.80 | 0.50 |
979 | 2009 | Novgorod Oblast | 14.80 | 49.10 | 14.00 | 1.80 | 0.80 |
1064 | 2010 | Novgorod Oblast | 14.80 | 49.60 | 14.70 | 1.90 | 0.90 |
620 | 2005 | Republic of Karelia | 15.00 | 36.20 | 14.90 | 0.70 | 0.50 |
525 | 2004 | Zabaykalsky Krai | 15.70 | 190.40 | 40.60 | 3.00 | 1.80 |
610 | 2005 | Zabaykalsky Krai | 16.30 | 202.10 | 31.00 | 3.00 | 1.20 |
535 | 2004 | Republic of Karelia | 16.90 | 33.20 | 14.00 | 0.70 | 0.50 |
562 | 2004 | Pskov Oblast | 17.50 | 67.90 | 10.20 | 0.70 | 0.20 |
436 | 2003 | Vologda Oblast | 18.10 | 61.70 | 24.60 | 0.40 | 0.60 |
df[df['vodka']>25]
year | region | wine | beer | vodka | champagne | brandy | |
---|---|---|---|---|---|---|---|
224 | 2000 | Komi Republic | 4.6 | 35.1 | 27.4 | 1.9 | 0.4 |
309 | 2001 | Komi Republic | 5.6 | 37.6 | 28.5 | 0.8 | 0.5 |
139 | 1999 | Komi Republic | 5.9 | 35.7 | 25.9 | 2.5 | 0.6 |
23 | 1998 | Kamchatka Krai | 6.4 | 25.5 | 25.1 | 0.8 | 0.4 |
108 | 1999 | Kamchatka Krai | 6.5 | 30.6 | 26.4 | 0.9 | 0.4 |
193 | 2000 | Kamchatka Krai | 7.1 | 32.7 | 26.6 | 0.8 | 0.4 |
363 | 2002 | Kamchatka Krai | 7.3 | 35.3 | 25.9 | 1.2 | 0.5 |
278 | 2001 | Kamchatka Krai | 7.4 | 33.6 | 26.7 | 0.9 | 0.4 |
448 | 2003 | Kamchatka Krai | 7.4 | 39.1 | 25.1 | 1.8 | 1.2 |
394 | 2002 | Komi Republic | 7.6 | 33.9 | 28.4 | 0.9 | 0.5 |
343 | 2002 | Arkhangelsk Oblast | 8.1 | 76.1 | 29.5 | 1.5 | 1.2 |
382 | 2002 | Nenets Autonomous Okrug | 8.1 | 76.1 | 29.5 | 1.5 | 1.2 |
479 | 2003 | Komi Republic | 8.3 | 44.4 | 29.7 | 0.8 | 0.5 |
405 | 2002 | Smolensk Oblast | 8.7 | 96.2 | 25.4 | 1.4 | 0.2 |
490 | 2003 | Smolensk Oblast | 8.8 | 92.3 | 25.6 | 1.5 | 0.3 |
428 | 2003 | Arkhangelsk Oblast | 9.2 | 109.0 | 31.9 | 1.7 | 1.3 |
467 | 2003 | Nenets Autonomous Okrug | 9.2 | 109.0 | 31.9 | 1.7 | 1.3 |
649 | 2005 | Komi Republic | 9.6 | 70.7 | 26.7 | 1.2 | 0.6 |
564 | 2004 | Komi Republic | 10.3 | 76.2 | 28.1 | 1.1 | 0.5 |
695 | 2006 | Zabaykalsky Krai | 13.7 | 207.3 | 31.7 | 3.1 | 1.3 |
521 | 2004 | Vologda Oblast | 14.2 | 85.1 | 26.7 | 0.6 | 0.7 |
525 | 2004 | Zabaykalsky Krai | 15.7 | 190.4 | 40.6 | 3.0 | 1.8 |
610 | 2005 | Zabaykalsky Krai | 16.3 | 202.1 | 31.0 | 3.0 | 1.2 |
Three biggest outliers for the three beverages are from the same region: Zabaykalsky and are over a period of three years, 2004, 05 and 06
df[df['region']=='Zabaykalsky Krai']
year | region | wine | beer | vodka | champagne | brandy | |
---|---|---|---|---|---|---|---|
15 | 1998 | Zabaykalsky Krai | 0.90 | 8.60 | 5.00 | 0.20 | 0.10 |
100 | 1999 | Zabaykalsky Krai | 1.10 | 11.50 | 5.80 | 0.20 | 0.10 |
185 | 2000 | Zabaykalsky Krai | 1.90 | 16.40 | 7.00 | 0.30 | 0.10 |
270 | 2001 | Zabaykalsky Krai | 3.50 | 23.80 | 8.60 | 0.40 | 0.10 |
355 | 2002 | Zabaykalsky Krai | 4.70 | 31.30 | 10.90 | 0.50 | 0.10 |
440 | 2003 | Zabaykalsky Krai | 5.90 | 38.10 | 12.40 | 0.60 | 0.10 |
1545 | 2016 | Zabaykalsky Krai | 6.40 | 30.80 | 6.80 | 0.90 | 0.30 |
1290 | 2013 | Zabaykalsky Krai | 6.40 | 64.50 | 12.70 | 1.10 | 0.40 |
1035 | 2010 | Zabaykalsky Krai | 6.40 | 65.30 | 13.40 | 1.00 | 0.40 |
950 | 2009 | Zabaykalsky Krai | 6.50 | 66.80 | 13.80 | 1.00 | 0.40 |
1120 | 2011 | Zabaykalsky Krai | 6.52 | 66.83 | 13.81 | 1.07 | 0.45 |
1460 | 2015 | Zabaykalsky Krai | 6.60 | 31.90 | 7.00 | 0.90 | 0.30 |
1375 | 2014 | Zabaykalsky Krai | 6.60 | 38.80 | 9.20 | 1.10 | 0.30 |
1205 | 2012 | Zabaykalsky Krai | 6.60 | 68.00 | 13.60 | 1.10 | 0.40 |
780 | 2007 | Zabaykalsky Krai | 6.70 | 61.90 | 13.50 | 0.80 | 0.20 |
865 | 2008 | Zabaykalsky Krai | 6.90 | 64.10 | 13.70 | 0.90 | 0.30 |
695 | 2006 | Zabaykalsky Krai | 13.70 | 207.30 | 31.70 | 3.10 | 1.30 |
525 | 2004 | Zabaykalsky Krai | 15.70 | 190.40 | 40.60 | 3.00 | 1.80 |
610 | 2005 | Zabaykalsky Krai | 16.30 | 202.10 | 31.00 | 3.00 | 1.20 |
For our models we will drop the data for 2004, 2005 and 2006 for Zabaykalsky Krai region, Also clearly beer is the most popular beverage in Russia
outlier_index = [695,525,610]
plt.figure(figsize=(15,8))
sns.set_palette('tab10')
sns.barplot(data=df_long,x='beverage',y='sales_per_capita')
plt.title('Most Popular Beverage in Russia',loc='center')
plt.xticks(rotation=45)
plt.ylabel('Sales Per Capita')
plt.xlabel('')
plt.show()
Since the data is collected over a period of time, trend of beverage cosumption in Russia may be interseting:
plt.figure(figsize=(15,8))
sns.lineplot(x='year',y='sales_per_capita',data=df_long,hue='beverage')
plt.xticks(rotation=45)
plt.ylabel('Sales Per Capita')
plt.xlabel('Year')
plt.title('Trend over the Years')
plt.show()
There is overall downward trend in beer consumption in Russia, although it is still the most consummed alcoholic beverage.
Also since, our firm had launched a successful campaign in St. Petersburg it would be good idea to have a closer look at it
plt.figure(figsize=(15,8))
sns.lineplot(x='year',y='sales_per_capita',data=df_long[df_long['region']=='Saint Petersburg'],hue='beverage')
plt.title('St. Petersburg Trend over the years')
plt.show()
Clearly wine consumption in recent year has gone up relatively while beer is consumed relatively lesser during last few years in St. Petersburg.
Final Exploratory Data Analysis I will be to do a correlation Analysis, to determine if there are highly correlated data.
plt.figure(figsize=(15,5))
sns.heatmap(df.corr(), annot=True)
plt.show()
Clearly champagne and brandy are highly correlated and wine and brandy are second most correlated alcoholic beverages.
High correlation between Brandy and Champagne can be handled by some feature engineering, like add those two columns together. Another way would be to use decomposition methods like PCA
From our exploratory data analysis, most of the data that were missing were either because data was not collected before a certain period. For our model we will drop those missing records
df_processed = df.dropna(how='any',axis=0)
Droping the rows with outliers
df_processed = df_processed.drop(outlier_index)
Will add the columns of brandy and champagne
df_processed['brandy_champagne'] = df_processed['brandy'] + df_processed['champagne']
df_processed.drop(['brandy','champagne'],axis=1, inplace=True)
df_processed.describe()
year | wine | beer | vodka | brandy_champagne | |
---|---|---|---|---|---|
count | 1546.000000 | 1546.000000 | 1546.000000 | 1546.000000 | 1546.000000 |
mean | 2007.023286 | 5.619909 | 51.229204 | 11.811125 | 1.837432 |
std | 5.493695 | 2.776228 | 24.341493 | 5.000204 | 1.131918 |
min | 1998.000000 | 0.100000 | 1.000000 | 0.400000 | 0.140000 |
25% | 2002.000000 | 3.600000 | 32.600000 | 8.340000 | 1.000000 |
50% | 2007.000000 | 5.400000 | 50.050000 | 11.500000 | 1.600000 |
75% | 2012.000000 | 7.307500 | 67.400000 | 15.000000 | 2.300000 |
max | 2016.000000 | 18.100000 | 143.000000 | 31.900000 | 7.310000 |
beverages_updated = ['wine', 'beer', 'vodka', 'brandy_champagne']
Plotting a pairplot to see the relationship between the variables and thier individual distribution
sns.pairplot(df_processed[beverages_updated])
plt.show()
Data is in good shape to build the first KMean model, for the model, year and reggions will be dropped
df_processed_model = df_processed.drop(['year','region'], axis=1)
Standard scaler is used to make the data within a similar range
scaler = StandardScaler()
scaler.fit(df_processed_model)
scaled_transformed_data = scaler.transform(df_processed_model)
Plotting an elbow plot to determine the number of clusters
inertia = []
for i in range(2,9):
kmean = KMeans(n_clusters=i, random_state=42)
kmean.fit(scaled_transformed_data)
inertia.append(kmean.inertia_)
plt.plot(range(2,9),inertia)
plt.title('Elbow Plot')
plt.xlabel('Number of Cluster')
plt.ylabel('Inertia')
plt.show()
We don't see a clear elbow, it's between 3 and 5, so we will consider the elbow in between 3 and 5 i.e. 4
kmean = KMeans(n_clusters = 4, random_state = 42)
kmean.fit(scaled_transformed_data)
KMeans(n_clusters=4, random_state=42)
df_processed['Cluster'] = kmean.labels_
df_processed[df_processed['region']=='Saint Petersburg']
year | region | wine | beer | vodka | brandy_champagne | Cluster | |
---|---|---|---|---|---|---|---|
144 | 1999 | Saint Petersburg | 2.6 | 57.40 | 13.00 | 2.30 | 0 |
59 | 1998 | Saint Petersburg | 2.7 | 27.90 | 12.30 | 1.70 | 1 |
229 | 2000 | Saint Petersburg | 4.4 | 68.20 | 14.70 | 2.90 | 3 |
569 | 2004 | Saint Petersburg | 4.7 | 103.90 | 13.90 | 3.30 | 2 |
314 | 2001 | Saint Petersburg | 6.2 | 101.00 | 15.50 | 3.20 | 2 |
399 | 2002 | Saint Petersburg | 6.3 | 104.60 | 17.20 | 3.50 | 2 |
484 | 2003 | Saint Petersburg | 6.6 | 105.60 | 14.30 | 3.80 | 2 |
1334 | 2013 | Saint Petersburg | 6.8 | 44.20 | 7.20 | 3.60 | 3 |
1164 | 2011 | Saint Petersburg | 6.9 | 82.93 | 9.01 | 3.84 | 2 |
1589 | 2016 | Saint Petersburg | 7.1 | 33.70 | 6.30 | 3.10 | 3 |
739 | 2006 | Saint Petersburg | 7.6 | 125.30 | 16.40 | 3.80 | 2 |
994 | 2009 | Saint Petersburg | 7.7 | 143.00 | 9.40 | 3.70 | 2 |
1249 | 2012 | Saint Petersburg | 8.0 | 77.20 | 8.60 | 4.10 | 2 |
1504 | 2015 | Saint Petersburg | 8.1 | 36.10 | 6.90 | 3.60 | 3 |
1419 | 2014 | Saint Petersburg | 8.2 | 41.80 | 7.70 | 4.20 | 2 |
1079 | 2010 | Saint Petersburg | 8.4 | 113.20 | 9.50 | 4.20 | 2 |
824 | 2007 | Saint Petersburg | 9.1 | 106.40 | 14.00 | 4.10 | 2 |
654 | 2005 | Saint Petersburg | 10.3 | 104.80 | 16.20 | 3.50 | 2 |
909 | 2008 | Saint Petersburg | 10.4 | 125.90 | 12.50 | 5.70 | 2 |
Although most of the St. Petersburg data is cluster areound 2 the most recent years are clustered at 3. So for finding similar regions will use cluster 3
St_Peterburrgs=df_processed['region'][df_processed['Cluster']==3].unique()
Implementing a hierarchial cluster to check how many clusters it provides
Z = linkage(scaled_transformed_data, 'ward')
plt.figure(figsize=(15,5))
dn = dendrogram(Z, truncate_mode = 'lastp')
plt.show()
Both KMeans and Hierarchial clustering seems to provide similar number of clusters
First will determine how many components of PCA defines the model
pca = PCA()
pca.fit_transform(scaled_transformed_data)
array([[-3.46956253, -0.72521644, 0.66959288, -0.54640255], [-2.46524707, 0.00819039, 0.30845386, -0.97287768], [-3.15803042, -0.31920428, 0.65464116, -0.62336092], ..., [ 1.71286378, 0.31806945, 1.87697617, 3.31954796], [ 2.18285413, -0.87862342, 0.89080792, 3.64043531], [ 3.15344991, 1.84315533, 0.63818022, 3.70826122]])
explained_variance = pca.explained_variance_ratio_
plt.bar(range(0,len(explained_variance)),explained_variance)
plt.xlabel('N Components')
plt.ylabel('Variance Explained')
plt.title('PCA')
plt.show()
Cumsum_variance = np.cumsum(explained_variance)
plt.plot(np.arange(0,len(Cumsum_variance)),Cumsum_variance)
plt.xlabel('Explained Variance')
plt.ylabel('Cummulative Explained Variance')
plt.title('Number of PCA Components')
plt.show()
Clearly three components explains all the variance in data
pca3 = PCA(n_components=3, random_state=42)
pca3_transformed_data=pca3.fit_transform(scaled_transformed_data)
Using the PCA decomposed data for the KMeans
inertia= []
for i in range(2,9):
kmean = KMeans(n_clusters=i, random_state=42)
kmean.fit(pca3_transformed_data)
inertia.append(kmean.inertia_)
plt.plot(range(2,9),inertia)
plt.show()
This also show 4 clusters
kmean_PCA = KMeans(n_clusters = 4, random_state = 42)
kmean_PCA.fit(pca3_transformed_data)
KMeans(n_clusters=4, random_state=42)
df_processed['Cluster_PCA'] = kmean_PCA.labels_
df_processed[['region','Cluster','Cluster_PCA']]
region | Cluster | Cluster_PCA | |
---|---|---|---|
1543 | Republic of Dagestan | 1 | 3 |
489 | Republic of North Ossetia-Alania | 1 | 3 |
81 | Chukotka Autonomous Okrug | 1 | 3 |
1458 | Republic of Dagestan | 1 | 3 |
1373 | Republic of Dagestan | 1 | 3 |
... | ... | ... | ... |
1064 | Novgorod Oblast | 2 | 2 |
620 | Republic of Karelia | 2 | 2 |
535 | Republic of Karelia | 2 | 2 |
562 | Pskov Oblast | 2 | 2 |
436 | Vologda Oblast | 2 | 2 |
1546 rows × 3 columns
df_processed[df_processed['region']=='Saint Petersburg']
year | region | wine | beer | vodka | brandy_champagne | Cluster | Cluster_PCA | |
---|---|---|---|---|---|---|---|---|
144 | 1999 | Saint Petersburg | 2.6 | 57.40 | 13.00 | 2.30 | 0 | 1 |
59 | 1998 | Saint Petersburg | 2.7 | 27.90 | 12.30 | 1.70 | 1 | 3 |
229 | 2000 | Saint Petersburg | 4.4 | 68.20 | 14.70 | 2.90 | 3 | 0 |
569 | 2004 | Saint Petersburg | 4.7 | 103.90 | 13.90 | 3.30 | 2 | 2 |
314 | 2001 | Saint Petersburg | 6.2 | 101.00 | 15.50 | 3.20 | 2 | 2 |
399 | 2002 | Saint Petersburg | 6.3 | 104.60 | 17.20 | 3.50 | 2 | 2 |
484 | 2003 | Saint Petersburg | 6.6 | 105.60 | 14.30 | 3.80 | 2 | 2 |
1334 | 2013 | Saint Petersburg | 6.8 | 44.20 | 7.20 | 3.60 | 3 | 0 |
1164 | 2011 | Saint Petersburg | 6.9 | 82.93 | 9.01 | 3.84 | 2 | 2 |
1589 | 2016 | Saint Petersburg | 7.1 | 33.70 | 6.30 | 3.10 | 3 | 0 |
739 | 2006 | Saint Petersburg | 7.6 | 125.30 | 16.40 | 3.80 | 2 | 2 |
994 | 2009 | Saint Petersburg | 7.7 | 143.00 | 9.40 | 3.70 | 2 | 2 |
1249 | 2012 | Saint Petersburg | 8.0 | 77.20 | 8.60 | 4.10 | 2 | 2 |
1504 | 2015 | Saint Petersburg | 8.1 | 36.10 | 6.90 | 3.60 | 3 | 0 |
1419 | 2014 | Saint Petersburg | 8.2 | 41.80 | 7.70 | 4.20 | 2 | 2 |
1079 | 2010 | Saint Petersburg | 8.4 | 113.20 | 9.50 | 4.20 | 2 | 2 |
824 | 2007 | Saint Petersburg | 9.1 | 106.40 | 14.00 | 4.10 | 2 | 2 |
654 | 2005 | Saint Petersburg | 10.3 | 104.80 | 16.20 | 3.50 | 2 | 2 |
909 | 2008 | Saint Petersburg | 10.4 | 125.90 | 12.50 | 5.70 | 2 | 2 |
As in the orignal KMean clustering, for the PCA version, will use cluster 0 as it corresponds to latest data
St_Peterburrgs_PCA=df_processed['region'][df_processed['Cluster_PCA']==0].unique()
Inorder to find 10 regions similar to St. Petersburg, will use the intersection of cluster of St. Petersburg from orignal KMeans clustering and one obtained from PCA followed by Kmean Clustering
Common_Regions = [*set(St_Peterburrgs).intersection(set(St_Peterburrgs_PCA))]
df_processed_sub=df_processed[df_processed['region'].isin(Common_Regions) & (df_processed['Cluster']==3) & (df_processed['Cluster_PCA']==0)]
df_processed_sub.sort_values('year', inplace =True)
Filtering the data by most recent year, will give us the regions with most current consumer behaviour
df_processed_sub[['region','year','wine']][df_processed_sub['year']==2016]
region | year | wine | |
---|---|---|---|
1532 | Amur Oblast | 2016 | 6.2 |
1588 | Samara Oblast | 2016 | 6.2 |
1581 | Primorsky Krai | 2016 | 6.8 |
1596 | Stavropol Krai | 2016 | 3.8 |
1574 | Novgorod Oblast | 2016 | 10.6 |
1587 | Ryazan Oblast | 2016 | 6.3 |
1598 | Republic of Tatarstan | 2016 | 4.4 |
1573 | Nizhny Novgorod Oblast | 2016 | 6.3 |
1614 | Yaroslavl Oblast | 2016 | 10.2 |
1604 | Ulyanovsk Oblast | 2016 | 6.6 |
1558 | Kostroma Oblast | 2016 | 7.2 |
1606 | Republic of Khakassia | 2016 | 4.3 |
1582 | Pskov Oblast | 2016 | 10.4 |
1575 | Novosibirsk Oblast | 2016 | 6.7 |
1607 | Khanty–Mansi Autonomous Okrug – Yugra | 2016 | 4.3 |
1552 | Kaluga Oblast | 2016 | 7.5 |
1612 | Sakha (Yakutia) Republic | 2016 | 4.3 |
1578 | Oryol Oblast | 2016 | 6.4 |
1557 | Kirov Oblast | 2016 | 7.3 |
1586 | Rostov Oblast | 2016 | 4.2 |
1561 | Republic of Crimea | 2016 | 4.4 |
1540 | Volgograd Oblast | 2016 | 5.5 |
1539 | Vladimir Oblast | 2016 | 8.8 |
1562 | Kurgan Oblast | 2016 | 4.6 |
1589 | Saint Petersburg | 2016 | 7.1 |
1559 | Krasnodar Krai | 2016 | 5.3 |
1546 | Ivanovo Oblast | 2016 | 8.2 |
1560 | Krasnoyarsk Krai | 2016 | 5.3 |
1601 | Tula Oblast | 2016 | 5.5 |
1563 | Kursk Oblast | 2016 | 5.3 |
1608 | Chelyabinsk Oblast | 2016 | 5.3 |
1542 | Voronezh Oblast | 2016 | 5.5 |
1567 | Mari El Republic | 2016 | 8.0 |
1580 | Perm Krai | 2016 | 5.5 |
1602 | Tyumen Oblast | 2016 | 5.1 |
1603 | Udmurt Republic | 2016 | 5.5 |
1613 | Yamalo-Nenets Autonomous Okrug | 2016 | 4.5 |
1544 | Jewish Autonomous Oblast | 2016 | 5.5 |
1599 | Tver Oblast | 2016 | 7.8 |
1600 | Tomsk Oblast | 2016 | 5.1 |
1541 | Vologda Oblast | 2016 | 8.6 |
1550 | Kaliningrad Oblast | 2016 | 8.6 |
1576 | Omsk Oblast | 2016 | 5.0 |
1579 | Penza Oblast | 2016 | 5.9 |
1592 | Sverdlovsk Oblast | 2016 | 7.6 |
1556 | Kemerovo Oblast | 2016 | 4.9 |
1564 | Leningrad Oblast | 2016 | 8.1 |
1597 | Tambov Oblast | 2016 | 4.8 |
1548 | Irkutsk Oblast | 2016 | 6.1 |
1565 | Lipetsk Oblast | 2016 | 6.2 |
1595 | Smolensk Oblast | 2016 | 7.9 |
1593 | Sevastopol | 2016 | 5.4 |
From the shorlisted 52 regions will select ten based on the wine consumtion per capita, which is similar to St. Petersburg
Final_regions = df_processed_sub[['region']][(df_processed_sub['year']==2016) & (df_processed_sub['wine']>7.8)]
Based on the cluster analysis of St. Petersburg and wine consumption pattern following are the ten Regions to run the marketing campaign
Final_regions
region | |
---|---|
1574 | Novgorod Oblast |
1614 | Yaroslavl Oblast |
1582 | Pskov Oblast |
1539 | Vladimir Oblast |
1546 | Ivanovo Oblast |
1567 | Mari El Republic |
1541 | Vologda Oblast |
1550 | Kaliningrad Oblast |
1564 | Leningrad Oblast |
1595 | Smolensk Oblast |